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Due to the wide applications, spreading processes on complex networks have been intensively studied. 
However, one of the most fundamental problems has not yet been well addressed: predicting the evolution of 
spreading based on a given snapshot of the propagation on networks. With this problem solved, one can 
accelerate or slow down the spreading in advance if the predicted propagation result is narrower or wider 
than expected. In this paper, we propose an iterative algorithm to estimate the infection probability of the 
spreading process and then apply it to a mean-field approach to predict the spreading coverage. The 
validation of the method is performed in both artificial and real networks. The results show that our method 
is accurate in both infection probability estimation and spreading coverage prediction. 

Many complex systems can be characterized by networks in which nodes represent the individuals and 
edges represent the interactions. Examples include citation networks 1,2 , communication networks 3 , 
transportation networks 4 , cyber networks 5 , financial networks 6 , just to name a few. The study of complex 
networks has therefore become a common focus of many branches of science. So far, great efforts have been made 
to understand and predict the evolution of networks. For instance, link prediction intends to identify which pair 
of nodes will be connected in the future 7,8 . Trend prediction aims at predicting the future degree of nodes 2,9 . 
However, most of the related works focus on the structural aspect of networks. Even though dynamical processes 
commonly take place in real networks 10 , the prediction of the evolution of dynamics on networks has been 
seriously overlooked. 

Spreading is an important kind of dynamics which has been applied to model many real processes on network 
such as spreading of disease 11 " 14 , propagation of news and rumors 15 " 18 , cascading failure of power grid 19 , and so on. 
In this paper, we focus on predicting the evolution of spreading. Solving this problem is very meaningful from the 
practical point of view. In the context of disease spreading, one can immunize nodes and links in advance to 
prevent the virus from covering the whole network if the predicted coverage of the spreading is very wide 13,20 " 24 . 
On the other hand, the propagation of some important information can be accelerated by adding more spreading 
seeds beforehand if the predicted coverage of the propagation is very narrow 25 " 30 . 

In the cases where prediction is needed, the known information of the spreading process is usually very limited, 
especially in the early stage of the spreading 31,32 . Similar to the ref. 33, we assume in this paper that only a snapshot 
of the spreading result is given. In the literature, the prediction of spreading is mostly based on the time series 
analysis 34 . The closest studies based on spreading snapshot are refs. 35, 36 where the observed snapshot is used to 
identify the initial spreader of a certain disease or information. In the prediction of spreading, the essential 
problem is how to accurately estimate the infection probability from the observed snapshot. One can consider the 
most straightforward method in which the infection probability is estimated based on each infected node i as \i { = 
mJMi where and M,- are respectively the infected number and the total number of is neighbors. By averaging \i { 
over all the infected nodes in the network, one can estimate the infection probability of the spreading. This 
method is referred as the "benchmark" method in this paper. However, the benchmark method may lead to 
serious overestimation of the infection probability. As this method doesn't distinguish which node spreads the 
virus to the infected node, each infected node may be used more than once in \i { = mJMi for different i (see the 
illustration in Fig. 1). 

To solve this problem, we develop an iterative algorithm for estimating the infection probability (IAIP for 
short) in which the problem of multiple use of the infected nodes is avoided. We validate the IAIP by simulating 
the Susceptible-Infected-Removed (SIR) model 37 in both artificial and real networks. The results show that our 
method can significantly outperform the benchmark method. Moreover, we study the case in which the iterative 
process is removed from our method (denoted as IAIP 0 ). The results show that IAIP 0 performs much less 
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Figure 1 | A snapshot of the spreading result in a toy network. The blue 
nodes are susceptible (marked by S), yellow nodes are infected (marked by 
I) and pink nodes are recovered (marked by R). Since each node inside the 
shade ellipse is connecting to two R nodes, one cannot distinguish which R 
node actually infected these two I nodes in the previous step. In the 
benchmark method, these two I nodes will be used when estimating the 
infection probability based on each R node, which finally leads to an 
overestimation of the infection probability. 

effectively than IAIP, indicating the crucial role of the iterative pro- 
cess. When the obtained infection probability is used in predicting 
the future spreading coverage, a much more accurate prediction can 
be achieved by using IAIP. 

Results 

We consider a network with N nodes and E links. The network is 
represented by an adjacency matrix A, where Ay = 1 if there is a link 



between node i and j, and Ay = 0 otherwise. To simulate the spread- 
ing process on networks, we employ the Susceptible-Infected- 
Removed (SIR) model 37 . In a network, we randomly select one node 
as the initial spreader. The virus from this node will infect each of this 
node's susceptible neighbors with probability /n, namely the infection 
probability. After infecting neighbors, the node will immediately 
become recovered (i.e., the recovering probability is 1). The new 
infected nodes in next step will infect their neighbors as the initial 
node. The spreading will be ended when there is no more infected 
node in the network. If it is not specially stated, we take the snapshot 
after five steps of spreading from the initial node as the known 
information. 

Epidemic spreading is a stochastic process. Given an infection 
probability and an initial infected node, the spreading results can 
vary significantly in different realizations. An observed snapshot 
may be corresponding to many different fi values. Therefore, one 
cannot use the deterministic models to exactly infer the /n value from 
the spreading snapshot. In this paper, we propose an iterative 
method to infer the fi value. Though the inference is not exact, we 
will show below the expected value of the obtained [i is very close to 
the real infection probability, with a relatively small dispersion. 

We first test the IAIP (see the Method section for description) in 
artificial networks: Watts -Strogatz (WS) networks 38 and Barabasi- 
Albert (BA) networks 39 . In Fig. 2(a) and (b), we show the estimated 
infection probability from the benchmark, IAIP 0 and IAIP methods 
/n e as a function of the true infection probability /n r . Obviously, if a 
method can accurately estimate the infection probability, the curve of 
this method in Fig. 2(a) and (b) should overlap well with /n e = fi Y . One 
can immediately notice that the curve of the IAIP locates around fi e 
= fi Y while the curve of the benchmark method is significantly higher 
than that, indicating a serious overestimation of the infection prob- 
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Figure 2 | The estimated infection probability fi e from different methods as a function of the true infection probability /i r in (a) WS and (b) BA networks. 
ji e = ji r is plotted with dashed lines to guide eyes. The distributions of ji e in (c) WS and (d) BA networks are shown. The error bars in (a)(b) and the 
distribution in (c)(d) are obtained by estimating the infection probability fi e under 100 spreading realizations from each node in the network. The 
network parameters are N = 4000, (k) = 10, p = 0.4 for WS networks, and N = 4000, (k) = 10 for BA networks. 
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ability in the benchmark method. Moreover, without the iterative 
process the curve of the IAIP 0 is lower than jd e = fi r . In Fig. 2(c) and 
(d), we fix an infection probability and investigate the disparity of /i e 
from the I AIP under different choice of initial spreaders (each node is 
selected once as the initial spreader). The distribution of /n e is rather 
narrow with (fi e ) ~ fi r , indicating the stable performance of the IAIP. 
Moreover, the deviation of fi e is much smaller in BA networks than 
that in WS networks. We thus conclude that IAIP performs more 
stably in the networks with heterogenous degree distribution which 
can be widely observed in real systems. 

In order to quantify the accuracy of the infection probability 

estimation, we define an error rate metric as 3 = — — — . 

Mr 

According to the definition, a smaller 3 indicates a more accurate 
estimation. We then investigate how the network topology affects the 
value of 3. For WS networks, we study the effect of the rewiring 
parameter p on 3. For BA network, we consider a variant of it in 
which each new node i connect to the existing node; with probability 
Pi = (ki + B)/Hj(kj + B) 40Al . This modified model allows a selection 
of the exponent of the power-law scaling in the degree distribution 
p(k) ~ k~ y , with y = 3 + Blm in the thermodynamic limit. With this 
network, we study the effect of B on 3. Related results on the WS and 
the modified BA networks are shown in Fig. 3. By comparing 
fig. 3(a)(b)(c), one can immediately see that when p is small, 3 of 
IAIP can be approximately 10 times smaller than that in the bench- 
mark method and 3 times smaller than that in IAIP 0 . Though 3 in 
both methods decreases with p, this effect is much stronger in the 
benchmark method. The local clustering effect of the WS network is 
destroyed whenp is large, which makes the infected nodes adjacent to 
each other less frequently. The problem of multiple use of the 
infected nodes in ^ z = m,-/M f becomes less serious in the benchmark 
method accordingly. However, note that in real social networks the 
clustering coefficient is usually very high, which indicates a low 
accuracy of the benchmark method in real applications. 
Fig. 3(d)(e)(f) show the results of the benchmark, IAIP 0 , IAIP meth- 



ods on the modified BA networks. One can see that IAIP still enjoys 
the smallest 3. Moreover, 3 of the benchmark method decreases with 
B in the modified BA networks. On the contrary, the performance of 
the IAIP method doesn't strongly depend on the network structure, 
indicating the high reliability of the IAIP method. 

In all the analysis above, we consider the spreading results at t = 5 
as the observed snapshot. As in real cases the snapshot at hand may 
be from different spreading stage, it is therefore interesting to study 
the relation between 3 and t. In Fig. 4, we report the dependence of 3 
on t. Fig. 4(a) and (b) are the results of the IAIP in WS and BA 
networks, respectively. One can see that there is an optimal 3 when 
tuning t. In order to understand this phenomenon, we show the 
number of infected nodes Nj versus the spreading step t in 
Fig. 4(c) and (d). Consistent with previous results in the literature, 
we observe here that iV 7 first increases then decreases with t. 
Interestingly, the optimal t* for 3 is the same as the t where N t 
achieves its maximum. When t is large, Nj is very small and the 
spreading is more or less at its final stage. In this situation 3 of 
IAIP is relatively high. However, this is not a problem in practice 
since usually we only need to predict the future spreading coverage 
when t is small. We also check the dependence of 3 on t in the 
benchmark method. We observe that 3 quickly increases with t. 
This is because the risk of overestimation of the infection probability 
becomes more serious when the virus covers a large part of the 
network. 

We further test the IAIP method in some large-scale real networks. 
Both undirected and directed networks are considered: Cond-mat 
(undirected scientific collaboration network) 42 , Youtube (undirected 
online users friendship network) 43 , EmailEU (directed email com- 
munication network) 44 , Delicious (directed online user friendship 
network) 45 . In Cond-mat, EmailEU and Delicious, the real infection 
probability is set as /n r = 0.2, and in Youtube, it is set as [i r = 0.05. In 
each realization, we randomly pick a node from the network and 
apply the benchmark, IAIP 0 and IAIP methods on the spreading 
snapshot at t = 5. We calculate the error rate 3 after the fi e is 
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Figure 3 | The dependence of the error rate S on p in WS networks for the (a) benchmark, (b) IAIP 0 and (c) IAIP methods. The dependence of 
the error rate S on B in the modified BA networks for the (d) benchmark, (e) IAIP 0 and (f) IAIP methods. In all artificial networks, the network size is 
N = 4000 and average degree is (k) =10. The mean values and error bars are obtained by estimating the infection probability fi e under 100 spreading 
realizations from each node in the network. 
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Figure 4 | The dependence of the estimation accuracy 3 of the IAIP on the time of the snapshot tin (a) WS and (b) BA networks, (c) and (d) are 
respectively the number of infected nodes Nj versus the spreading step tin WS and BA networks. The network parameters are the same as those in Fig. 2. 
The mean values and error bars are obtained by simulating 100 spreading realizations from each node in the network. 



obtained. The mean error rate (3) of each method is finally obtained 
by randomly selecting 1000 initial nodes and simulating 100 spread- 
ing realizations from each of these initial nodes. Results on the real 
networks are reported in table 1. Consistent with the results in arti- 
ficial networks, the IAIP method enjoys a much smaller error rate 
than the IAIP 0 and benchmark methods. 

Accurately estimating fi can lead to many applications, here we are 
mainly interested in predicting the spreading coverage based on the 
fi e . At the mean-field level, the dynamics of the SIR model in complex 
networks can be described by differential equations as 46 



©w = 



E*fcP(fc)ft(0 

(k) 



(2) 



where P(k) is the degree distribution and (k) is the average degree of 
the network. In order to predict the coverage at time t + 1, one can 
follow 

N s (t+l) = N s (t) + dN s (t) 

= N s (t)-NnY, kP ( k ) S k( t M t )> 



dS k {t) 

dt 
dl k (t) 

dt 
dR k (t) 

dt 



-fikS k (t)&(t), 



= -I k (t)+i*S k (i)®(i), 



(3) 



(1) 



-m, 



where S k (t), I k (t) and R k (t) are the density of susceptible, infected, and 
removed nodes of degree k at time t, respectively. According to the 
definition, S k (t) + I k (t) + R k (t) = 1. The factor ®(t) represents the 
probability that any given link points to an infected node and is given 
by 



N I (t+l)=N I (t) + dN I (t) 

= N^kP(k)S k (t)®{t) 9 

k 

N R (t+l)=N R (t) + dN R (t)=N R {t)+N I (t). 



The equation (3) can be iteratively used to predict the spreading 
coverage in longer term, namely t + m. We refer this method as 
the mean-field (MF) prediction method. From equation (3), one 
can see that the essential parameter determining the prediction accu- 
racy is fi. We thus study the prediction accuracy when fi e of the 



Table 1 Basic structural properties (network size 


N, edge number E) 


and the mean error 
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Figure 5 | The predicted and real evolution of Ni + N R in (a) WS and (b) BA networks. The real infection probability is set as fi r = 0.15. m is the 
number of spreading steps after the observed snapshot. The network parameters are the same as those in Fig. 2. The mean values and error bars are 
obtained by predicting the future spreading coverage under 100 spreading realizations from each node in the network. 



benchmark, IAIP 0 and IAIP methods are used. The results in Fig. 5 
show that the mean-field predictors with both IAIP 0 and IAIP meth- 
ods are close to the true evolution. 

Besides the mean-field model, we have considered some more 
realistic models, such as the pair approximation model 47 " 49 and 
moment closure approximation model 50 . The main difference 
between the mean-field and pair approximation is that the former 
(latter) approximates high-order moments in term of first (second) 
order ones. For the moment closure approximation, it can incorp- 
orate the structure of the network into the model and allows for the 
definition of the triples in terms of pairs. We applied the estimated fi 
value to the pair approximation models 47,48 , and find consistent 
results to the mean-filed case, i.e., the prediction based on IAIP 0 
and IAIP methods is very close to the true evolution. 

Discussion 

Prediction in complex networks has always been an important 
research topic. Though many related researches have been done, 
most of them focus on structural aspects such as link prediction 
and node popularity prediction. The problems of estimating infec- 
tion probability from a given spreading snapshot and accordingly 
predicting the spreading results are very important, with many 
potential applications in real systems. However, little has been done 
in this research direction. In this paper, we first design an iterative 
algorithm to estimate the spreading infection probability from an 
observed spreading snapshot. The simulation in both artificial and 
real networks shows that our method enjoys a high accuracy in 
estimating the spreading infection probability. Finally, the estimated 
infection probability is applied to a mean-field method for predicting 
the evolution of the spreading coverage. 

In this paper, we consider the basic SIR model in which the recov- 
ery probability is set as = 1. The infectious period is one time-step. 
We also investigate the more complicated case where ft < 1. Our 
model cannot be directly applied to estimate the parameter 
However, in this case the fi value obtained from our method is 
actually corresponding to the effective infection probability, i.e. /n eff 
= fi/p. We observe that the estimation of ^ eff becomes less accurate 
when P is smaller. In fact, the situation of ft < 1 is very complicated, 
which requires some new method directly estimating /n and ft. 
Related research in this direction is an interesting extension in the 
future. 

Some more issues remain still open. In this paper, we focus on the 
SIR model, it would be interesting to examine the proposed iterative 
method in some other spreading models such as SI, SIS. Moreover, 
the mean-field prediction method in this paper can only predict the 
width of the spreading. A more interesting and important issue 
would be predicting which nodes will be infected in the future. 
Besides spreading, there are many other dynamical processes on 



networks such as synchronization and percolation 51 ' 52 . We hope 
the method and results in this paper can inspire some prediction 
methods for other dynamical processes. 

Methods 

We now describe the iterative algorithm for estimating the infection probability (the 
IAIP method). In a snapshot of the spreading results, we denote the number of 
infected nodes as N h the number of susceptible nodes as N s and the number of 
recovered nodes as N R . According to the definition, N s + Nj + N R = N. The infection 
probability can be calculated as 



Nr+Ni 

EieR^-™;)' 



(4) 



where m { is the number of already infected nodes (both J and R nodes) among is 
neighbors when i tries to infect other nodes. 

Apparently, the exact value of m f cannot be directly extracted from the snapshot. 
One can estimate m f by its expected value 



1X^=0= I>^£ 



EZi/<-(i-m) 5 



l-(l-rt M ' 



(5) 



where M { is the total number of I and R nodes among is neighbors in the observed 
snapshot. 

In the equations above, one can see that fi and m; depends on each other. They are 
expected to respectively approach their true values during the iterations. In the 

simulation, we set the initial m z = 0, such that pi 0 = (N R + N{) I ^\ £ R kj. The eqs. (4) 

and (5) are then iterated until the change of the difference 



(6) 



in two successive steps is less than a small threshold of 10" 4 . 

In this paper, we consider also the performance of the above method without the 
iterative process, denoted as the IAIP 0 method. It simply calculates the ft by eq. (4) 
without updating m* from eq. (5), i.e. directly setting m ; =0. 
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